2016 Presidential Election Contributions - California

Huey Kwik

## Parsed with column specification:
## cols(
##   cmte_id = col_character(),
##   cand_id = col_character(),
##   cand_nm = col_character(),
##   contbr_nm = col_character(),
##   contbr_city = col_character(),
##   contbr_st = col_character(),
##   contbr_zip = col_integer(),
##   contbr_employer = col_character(),
##   contbr_occupation = col_character(),
##   contb_receipt_amt = col_double(),
##   contb_receipt_dt = col_character(),
##   receipt_desc = col_character(),
##   memo_cd = col_character(),
##   memo_text = col_character(),
##   form_tp = col_character(),
##   file_num = col_integer(),
##   tran_id = col_character(),
##   election_tp = col_character()
## )
## Joining, by = "zip"

Univariate Plots

## [1] 1072770      25
## Classes 'tbl_df', 'tbl' and 'data.frame':    1072770 obs. of  25 variables:
##  $ cmte_id          : chr  "C00575795" "C00575795" "C00575795" "C00577130" ...
##  $ cand_id          : chr  "P00003392" "P00003392" "P00003392" "P60007168" ...
##  $ cand_nm          : Factor w/ 25 levels "Bush, Jeb","Carson, Benjamin S.",..: 4 4 4 20 20 20 20 4 20 20 ...
##  $ contbr_nm        : chr  "AULL, ANNE" "CARROLL, MARYJEAN" "GANDARA, DESIREE" "LEE, ALAN" ...
##  $ contbr_city      : chr  "LARKSPUR" "CAMBRIA" "FONTANA" "CAMARILLO" ...
##  $ contbr_st        : chr  "CA" "CA" "CA" "CA" ...
##  $ contbr_zip       : int  949391913 934284638 923371507 930111214 902784310 902784310 920842849 926372912 926833846 949522729 ...
##  $ contbr_employer  : chr  "N/A" "N/A" "N/A" "AT&T GOVERNMENT SOLUTIONS" ...
##  $ contbr_occupation: chr  "RETIRED" "RETIRED" "RETIRED" "SOFTWARE ENGINEER" ...
##  $ contb_receipt_amt: num  50 200 5 40 35 100 25 40 10 15 ...
##  $ contb_receipt_dt : Date, format: "2016-04-26" "2016-04-20" ...
##  $ receipt_desc     : chr  NA NA NA NA ...
##  $ memo_cd          : chr  "X" "X" "X" NA ...
##  $ memo_text        : chr  "* HILLARY VICTORY FUND" "* HILLARY VICTORY FUND" "* HILLARY VICTORY FUND" "* EARMARKED CONTRIBUTION: SEE BELOW" ...
##  $ form_tp          : chr  "SA18" "SA18" "SA18" "SA17A" ...
##  $ file_num         : int  1091718 1091718 1091718 1077404 1077404 1077404 1077404 1091718 1077404 1077404 ...
##  $ tran_id          : chr  "C4768722" "C4747242" "C4666603" "VPF7BKWA097" ...
##  $ election_tp      : Factor w/ 3 levels "G2016","P2016",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ cand_last_name   : Factor w/ 25 levels "Bush","Carson",..: 4 4 4 20 20 20 20 4 20 20 ...
##  $ party            : Ord.factor w/ 5 levels "Democratic"<"Republican"<..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ zip              : chr  "94939" "93428" "92337" "93011" ...
##  $ city             : chr  "Larkspur" "Cambria" "Fontana" "Camarillo" ...
##  $ state            : chr  "CA" "CA" "CA" "CA" ...
##  $ latitude         : num  37.9 35.6 34 34 33.9 ...
##  $ longitude        : num  -123 -121 -117 -119 -118 ...

Distribution of Contribution Amounts

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
## -10000.0     15.0     27.0    122.7    100.0  10800.0        1
## [1] "Number of negative contributions: 10617"

Observations:

  • I used log(count) so we could more easily see the data on a chart.
  • Why are there negative contributions?
  • Range of contributions range from -10000 to 10800
  • There is one missing value for contb_receipt_amt in the dataset.
  • Most contributions seem to be smaller
  • Fewer than 1% of the contributions are negative, so it’s probably okay to continue analysis and just note this. However, still curious about why.
  • Isn’t there a $2700 contribution limit for individuals? We could check if contributions greater than $2700 also have negative contributions.

Double Counting Contributions?

## # A tibble: 8 × 5
##            contbr_nm contb_receipt_dt contb_receipt_amt        tran_id
##                <chr>           <date>             <dbl>          <chr>
## 1    HOROWITZ, DAVID       2015-07-06             10800   SA17A.123258
## 2    HOROWITZ, DAVID       2015-07-06             -5400 SA17A.123258.0
## 3 HOROWITZ, MICHELLE       2015-07-06              5400 SA17A.123258.1
## 4 HOROWITZ, MICHELLE       2015-07-06             -2700 SA17A.123258.2
## 5 HOROWITZ, MICHELLE       2015-07-06              2700 SA17A.123258.3
## 6    HOROWITZ, DAVID       2015-07-06             -2700 SA17A.123258.4
## 7    HOROWITZ, DAVID       2015-07-06              2700 SA17A.123258.5
## 8 HOROWITZ, MICHELLE       2015-07-06              5400   SA17A.123263
## # ... with 1 more variables: election_tp <fctr>

I looked at some examples of contributions that were above $2700 and came across David and Michelle Horowitz. They appear to be a couple who donated to Scott Walker’s campaign.

Summing up contb_receipt_amt, we get $10,800. Is this an instance of people contributing over the limit?

From what I can tell, this instead seems to be double-counting! The FEC provides an Individual Contributor Search, which lets us look at each contributor record in more detail.

From there, I was able to piece this story:

  • On 7/6/2015, David Horowitz donates a total of $10,800 to Scott Walker’s primary campaign.
  • Of that $10,800, he reattributes $5,400 to Michelle Horowitz.
  • He redesignates $2,700 to Walker’s General Election campaign. This money gets refunded on 11/16/2015.
  • The $5,400 seems appears twice, in the 3rd and 8th row of the table.
  • Assuming that those records are duplicates, then Michelle Horowitz reattributes $2,700 to Walker’s General Election campaign.

If this story is true, then these donations are within the campaign contribution limits for primary and general elections. From an election integrity standpoint, this is good.

However, when doing analysis of this data, we should be aware of this discrepancy in our analysis. A contribution like Michelle Horowitz’s reattributed $5,400 may be double-counted in our analysis. Also, a large contribution of $10,800 by David Horowitz will count towards calculating the mean, even though it gets reattributed into smaller contributions later.

Party

Democrats had the most contributions by far, which makes sense in California.

Contribution Locations

## Joining, by = "zip"
## Warning: Removed 10 rows containing missing values (geom_point).

As you can see, there are contributions from zipcodes that are outside of California. However, since all the state fields in the data set are recorded as California, I think these might be cases where the data was entered incorrectly.

Let’s restrict our visualization to known California zipcodes:

## Joining, by = "zip"

It seems like most of the contributions are centered around the major cities in California: Los Angeles, San Francisco, San Diego, and Sacramento.

Occupations, Employers

Above, we look at hte top 10 occupations and employers in our dataset.

Univariate Analysis

What is the structure of your dataset?

There are 1,073,271 records in the dataset with 18 features.

The features are as follows:

  • committee id
  • candidate id
  • candidate name
  • contributor name
  • contributor’s city
  • contributor state
  • contributor zipcode
  • contributor’s employer
  • contributor’s occupation
  • contribution amount
  • contribution date
  • receipt description
  • memo code
  • memo text
  • form type
  • file number
  • transaction id
  • election type

Factors: Candidate name, election type (Primary 2016, General 2016, or Primary 2020)

Other observations:

  • Date range: April 1, 2015 to October 31, 2016
  • There are 60587 unique employers.
  • There are 136234 unique zipcodes.
  • There are 2418 unique cities.
  • There are 209688 unique contributors.
  • There are 26654 unique occupations.
  • There are seven donations designated for Primary 2020. They are all redesignations from general election contributions to Lindsey Graham.

What is/are the main feature(s) of interest in your dataset?

I’m mostly interested in looking at patterns/differences in contributions among different candidates. So for me, the main features of interest are candidate name, contribution amount, and date.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Zipcode, employers, occupation, and party might provide other angles into the data.

Did you create any new variables from existing variables in the dataset?

I created a Party variable to represent the political party for each canddiate.

I also merged in data from the zipcode dataset: zip, city, state, latitude, and longitude.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

When histogramming the contribution amounts, I used a log scale since one of the bins was really large. This made it easier to see the rest of the data.

Bivariate Plots

Contributions per Candidate

##  contb_receipt_amt 
##  Min.   :-5700.00  
##  1st Qu.:   15.00  
##  Median :   27.00  
##  Mean   :   50.57  
##  3rd Qu.:   50.00  
##  Max.   :10000.00
##  contb_receipt_amt
##  Min.   :-5400.0  
##  1st Qu.:   15.0  
##  Median :   25.0  
##  Mean   :  146.3  
##  3rd Qu.:  100.0  
##  Max.   : 7300.0  
##  NA's   :1

Since Sanders is often portrayed as the more progressive, blue-collar candidate, it is interesting to see that Clinton’s median donation is actually lower. It is interesting that Clinton and Sanders average donation amounts are roughly the same. Clinton’s median donation is actually lower, i.e. $27 vs. $25. Of course, this data does not include donations to Political Action Committees, so that could be a factor.

For the box plots, I sorted the candidates from highest number of donations to lowest.

In the first box plot, we can see that some candidates actually have many donations above the individual limit of $2700. Many also have negative donations, which could either be refunds or reattributions.

In the second box plot, I excluded negative contributions to see if we could see any other patterns.

Life of a Campaign

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Observations:

  • Campaign donations start in early 2015 but really pick up in 2016, which coincides with primary season.
  • The Republican field was crowded. Trump had a huge spike in donations in the middle of 2016, perhaps when it was clear he was going to win the nomination.
  • Sanders vs. Clinton is interesting to look at. It looks like Sanders actually gets more donations during mcuh of the primary season. The decline in donations also corresponds to roughly when it becomes apparent that Sanders will lose the nomination.
  • Evan McMullin’s donations picked up in the final month of the campaign.

Donation vs. Duration of Campaign

Last donation date could be a proxy variable for how long a campaign lasts. As we can see, this is positively correlated with the total number of donations.

ggplot(data = contributions.candidates, aes(x = total, y = n)) +
  geom_point() +
  scale_x_log10() +
  scale_y_log10()

primaries <- inner_join(contributions.candidates, votes, by = c("cand_last_name" = "last_name"))
## Warning in inner_join_impl(x, y, by$x, by$y, suffix$x, suffix$y): joining
## factors with different levels, coercing to character vector
ggplot(data = primaries, aes(x = n, y = delegates)) + 
  geom_point() +
  scale_x_log10() + 
  facet_wrap(~party)

ggplot(data = primaries, aes(x = n, y = votes)) + 
  geom_point() +
  scale_x_log10() + 
  scale_y_log10() +
  facet_wrap(~party)

There seems to be a positive correlation between the log number of votes and the log number of donations.

Contributions by Party and Location

This visualization again shows concentration of donation activity around cities, but also shows more domination by Democrats.

Party vs. Occupation

Candidate vs. Occupation

Over 60% of donations to Bernie Sanders were from people whose occupation was “Not Employed”.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Number of contributions and total contributions are positively correlated.

Both of these features are positively correlated with number of votes and number of delegates in the primary election.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

What was the strongest relationship you found?

Multivariate Plots Section

Number of contributions + Party + Location

# Group by city to consolidate zipcodes.
contributions.city.party <- contributions.ca %>%
  group_by(city, party) %>% 
  summarise(
    total = sum(contb_receipt_amt),
    mean = mean(contb_receipt_amt),
    median = median(contb_receipt_amt),
    latitude = mean(latitude),
    longitude = mean(longitude),
    n = n())

ca_base <- ggplot() + 
  geom_polygon(data = ca_df, aes(x = long, y = lat, group = group), fill=NA, color="black") + 
  coord_fixed(1.3) +
  scale_size(range = c(0, 6))

# All five parties at once.
ca_base +
  geom_point(data = contributions.city.party, aes(x = longitude, y = latitude, color = party, size = n), alpha = 1/2) +
  scale_color_manual(values = party.colors)

ca_base + 
  geom_point(data = contributions.city.party, aes(x = longitude, y = latitude, size = total, color = party), alpha = 1/2) +
  scale_color_manual(values = c("#3333FF", "#EE3523", "#FED105", "#17aa5c", "#DDDDDD"))
## Warning: Removed 1 rows containing missing values (geom_point).

# Just Democratic.
ca_base +
  geom_point(data = contributions.city.party %>% filter(party == "Democratic"), 
             aes(x = longitude, y = latitude, size = n), 
             color = party.colors[1],
             alpha = 1/2) 

# Just Republican.
ca_base + 
  geom_point(data = contributions.city.party %>% filter(party == "Republican"), 
             aes(x = longitude, y = latitude, size = n), 
             color = party.colors[2],
             alpha = 1/2) 

# Number of contributions, facet by party.
ca_base + 
  geom_point(data = contributions.city.party, aes(x = longitude, y = latitude, size = n, color=party), alpha = 1/2) +
  scale_color_manual(values = party.colors) +
  facet_wrap(~party)

# Total of contributions, facet by party.
ca_base + 
  geom_point(data = contributions.city.party, aes(x = longitude, y = latitude, size = total, color=party), alpha = 1/2) +
  scale_color_manual(values = party.colors) +
  facet_wrap(~party)
## Warning: Removed 1 rows containing missing values (geom_point).

Where are Clinton’s Supporters? Bernie’s Supporters?

contributions.city.cand <- contributions.ca %>%
  filter(election_tp == "P2016") %>%
  group_by(city, cand_last_name) %>% 
  summarise(
    total = sum(contb_receipt_amt),
    mean = mean(contb_receipt_amt),
    median = median(contb_receipt_amt),
    latitude = mean(latitude),
    longitude = mean(longitude),
    n = n())

dems = c("Clinton", "Sanders")
ca_base + 
  geom_point(data = contributions.city.cand %>% filter(cand_last_name %in% dems), aes(x = longitude, y = latitude, size = total, color = cand_last_name))

ca_base + 
  geom_point(data = contributions.city.cand %>% filter(cand_last_name %in% dems), aes(x = longitude, y = latitude, size = total, color = cand_last_name)) +
  facet_wrap(~cand_last_name)

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

The Democratic and Republican parties receive donations from similar areas, with the Democrats receiving more donations from more largely populated areas.

Were there any interesting or surprising interactions between features?

I expected to see some difference between Clinton and Sanders support geographically.


Final Plots and Summary

Plot One

Description One

Plot Two

Description Two

Plot Three

Description Three


Reflection